Lab Assignment 2 for CSE 7324 Fall 2017

Members: Hongning Yu, Hui Jiang, Hao Pan

1. Business Understanding

The dataset we use is a lyrics dataset (lyrics from MetrLyrics), which can be downloaded from Kaggle for free: https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics. By exploring this dataset, we are able to know the key features of certain song genre and predict the corresponding genre for new songs.

In this dataset, there are 362237 records and 5 features (song name, year, artist, genre, and lyrics). It is comprised of text documents and contains only text divided into documents. Besides, we can predict song genres according to lyrics, so it meets requirements for Lab 2.

For this project, our mainly purpose is to find the features for different song genres by analyzing the most frequent words in lyrics. And visualizing features will reveal more information about those features in the dataset. And then we may be able to figure out the relationship among features, which might benefit our genre prediction as well.

The statictic and prediction results can be applied to applications related to song searching or displaying. For example, song searching applications, like Siri may use when you ask her "What song is it?", can narrow down song searching scope by classify songs according to lyric features. As for song displaying application, it could reconmend songs by analyzing lyrics from users' favorite songs.

To ensure the correct rate of our prediction, we will keep a predict accuracy(AUC) target, like 80%, using accuracy measurement functions. We will use other more helpful evaluation metrics and functions if needed.

2. Data Encoding

First let's load the data in to dataframe. The data is already in a csv file but all of the lyrics are in raw text with different formats. Our gold is to predict genre basing on lyrics, so we still need to clean all lyrics.

In [1]:
import pandas as pd
import nltk
import numpy as np
import string

pd.set_option('display.max_columns', 60)
In [2]:
df = pd.read_csv("./lyrics.csv", encoding="utf-8")
df.head()
Out[2]:
index song year artist genre lyrics
0 0 ego-remix 2009 beyonce-knowles Pop Oh baby, how you doing?\nYou know I'm gonna cu...
1 1 then-tell-me 2009 beyonce-knowles Pop playin' everything so easy,\nit's like you see...
2 2 honesty 2009 beyonce-knowles Pop If you search\nFor tenderness\nIt isn't hard t...
3 3 you-are-my-rock 2009 beyonce-knowles Pop Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote...
4 4 black-culture 2009 beyonce-knowles Pop Party the people, the people the party it's po...

check null values in dataset.

In [3]:
df.isnull().sum()
Out[3]:
index         0
song          2
year          0
artist        0
genre         0
lyrics    95680
dtype: int64

Looks like there are null values in lyrics and song. Just drop them.

In [4]:
df.dropna(inplace=True)
df.isnull().sum()
Out[4]:
index     0
song      0
year      0
artist    0
genre     0
lyrics    0
dtype: int64

check genre

In [5]:
df.genre.value_counts()
Out[5]:
Rock             109235
Pop               40466
Hip-Hop           24850
Not Available     23941
Metal             23759
Country           14387
Jazz               7970
Electronic         7966
Other              5189
R&B                3401
Indie              3149
Folk               2243
Name: genre, dtype: int64

As we can see, some genres have way more records than others. For our genre-predicting classification problem, we could sample the dataset and choose subsets of some genres to avoid bias. But let's now keep it as it is and deal with this later.

Check certain genres:

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 266556 entries, 0 to 362236
Data columns (total 6 columns):
index     266556 non-null int64
song      266556 non-null object
year      266556 non-null int64
artist    266556 non-null object
genre     266556 non-null object
lyrics    266556 non-null object
dtypes: int64(2), object(4)
memory usage: 14.2+ MB

2.1 Read in data and check data quality

Change to ASCII

First let's try to get rid of all non-ascii characters, since we only want english characters

Takes too much time

In [7]:
# %%time
# import re
# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = df.loc[row, 'lyrics'].encode('ascii', errors='ignore').decode()

# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = re.sub(r'[^\x00-\x7f]',
#                                    r'', 
#                                    df.loc[row, 'lyrics']) 

English Filter

We want to focus on song's with english lyrics, so let's delete all non-english records if they exist.

I tried to build a English-ratio detector to eliminate all non-english songs. Reference: https://github.com/rasbt/musicmood/blob/master/code/collect_data/data_collection.ipynb

But the loop of set calculation takes too much time. Need to improve.

In [8]:
# %%time
# def eng_ratio(text):
#     ''' Returns the ratio of non-English to English words from a text '''

#     english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 
#     text_vocab = set(w.lower() for w in text.split('-') if w.lower().isalpha()) 
#     unusual = text_vocab.difference(english_vocab)
#     diff = len(unusual)/(len(text_vocab)+1)
#     return diff

    
# # first let's eliminate non-english songs by their names
# before = df.shape[0]
# for row_id in range(100):
#     text = df.loc[row_id]['song']
#     diff = eng_ratio(text)
#     if diff >= 0.5:
#         df = df[df.index != row_id]
# after = df.shape[0]
# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)

English Filter Ver.2

This is another approach, which uses a package from https://github.com/saffsd/langid.py. This package can detect language in a fairly quicker way. But still, 260k records takes around 50 mins.

In [9]:
# # package from https://github.com/saffsd/langid.py
# import langid

# before = df.shape[0]
# for row in df.index:
#     lang = langid.classify(df.loc[row]['lyrics'])[0]
#     if lang != 'en':
#         df = df[df.index != row]
# after = df.shape[0]

# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)
23693 have been removed.
242863 songs remain in the dataset.

save english songs to a new csv

In [10]:
# df.to_csv('lyrics_new.csv',index_label='index')

Re-read csv file as df

Now only English songs exists in our dataset.

In [11]:
df = pd.read_csv("./lyrics_new.csv", encoding="utf-8").drop('index.1', axis=1)
df.genre.value_counts()
Out[11]:
Rock             102619
Pop               34919
Hip-Hop           23042
Metal             22249
Not Available     18654
Country           14307
Jazz               7498
Electronic         7374
Other              3951
R&B                3362
Indie              3010
Folk               1878
Name: genre, dtype: int64

Resampling df --> df_sample

300k records easily run out of memory. So I tried to resample the dataset and choose equal size of each genre.

In [12]:
grouped = df.groupby('genre')
df_sample = grouped.apply(lambda x: x.sample(n=1800, random_state=7))

print("Size of dataframe: {}".format(df_sample.shape[0]))
      
df_sample.genre.value_counts()
Size of dataframe: 21600
Out[12]:
Country          1800
Indie            1800
Hip-Hop          1800
Rock             1800
Folk             1800
Electronic       1800
Pop              1800
Other            1800
Metal            1800
Jazz             1800
Not Available    1800
R&B              1800
Name: genre, dtype: int64
In [13]:
# reset index means remove index (and change index to a column if not drop)
df_sample.reset_index(drop=True, inplace=True)
df_sample.head(10)
Out[13]:
index song year artist genre lyrics
0 104901 it-s-great-to-be-single-again 2007 david-allan-coe Country No more dirty dishes in the sink when I come h...
1 216767 how-can-you-buy-killarney 2007 charlie-landsborough Country An American landed on Erin's green isle\nHe ga...
2 126582 sawing-on-the-strings 2007 alison-krauss Country Way back in the mountains\nWay back in the hil...
3 129927 i-don-t-believe-you-ve-met-my-baby 2006 dolly-parton Country Last night, my tears they were fallin'\nI went...
4 218507 please-don-t-hurry-your-heart 2008 caitlin-cary Country Oh, when you're leaving for the hundredth time...
5 80092 new-dug-grave 2007 gillian-welch Country I left home when I was twenty\nJust to see wha...
6 310224 no-memories-hangin-round 2015 bobby-bare Country You don't want no more heartaches\nAnd I don't...
7 68325 that-s-how-much-i-love-you 2014 eddy-arnold Country Well if I had a nickel I know what I would do\...
8 215612 i-m-fine-either-way 2007 bobby-pinson Country Come on\nMouth full of blood one eye swoll shu...
9 191238 c-mon 2014 amber-hayes Country Hey, hey I'm lookin' at you\nBoy I gotta tell ...

Check the lyrics' quality

In [14]:
# check lyrics with length less than 100
less_than_100 = 0
for row in df_sample.index[:1000]:
    if len(df_sample.loc[row]['lyrics'])<=100:
        print(df_sample.loc[row]['lyrics'])
        less_than_100 += 1
print("\nNum of lyrics with length less than 100 in first 1000: {}".format(less_than_100))
instrumental
This track is an instrumental and has no lyrics.
guitars and cadilacs
hillbilly music
only thing that keeps me hanging on
instrumental
INSTRUMENTAL

Num of lyrics with length less than 100 in first 1000: 5

It looks like lots of songs don't have meaningful lyrics(instrumental music, or something wrong happened when crawling).

So we just drop all song records with less than 100 lyric length

df_sample --> df_clean

In [15]:
print("Deleting records with lyric length < 100")

len_before = df_sample.shape[0]

df_clean = df_sample.copy()

for row in df_clean.index:
    if len(df_clean.loc[row]['lyrics']) <= 100:
        df_clean.drop(row, inplace=True)

len_after = df_clean.shape[0]

print("Before: {}\nAfter : {}\nDeleted: {}".format(len_before, len_after, len_before-len_after))
Deleting records with lyric length < 100
Before: 21600
After : 20954
Deleted: 646
In [16]:
df_clean.genre.value_counts()
Out[16]:
Country          1791
R&B              1788
Other            1783
Pop              1779
Hip-Hop          1771
Indie            1768
Jazz             1758
Rock             1756
Metal            1723
Not Available    1694
Electronic       1686
Folk             1657
Name: genre, dtype: int64

transfer lyrics to list

df_clean --> x & y

In [17]:
x = df_clean['lyrics'].values
y = df_clean['genre'].values
print('Size of x: {}\nSize of y: {}'.format(x.size, y.size))

x = x.tolist()

x[1]
Size of x: 20954
Size of y: 20954
Out[17]:
"An American landed on Erin's green isle\nHe gazed on killarny with a rapturous smile\nHow can I buy it he said to the guy\nI'll tell you how with a smile he replied\nHow can you buy all the stars in the sky\nHow can you buy two Blue Irish eyes\nWhen you can purchase a fine mothers heart\nThen you can buy killarny\nNature restore on her guilt's with a smile\nMe and Rose the shamrock and the barley\nWhen you can buy all those wonderful things\nThen you can buy killarny\nOver in Killarny, Many years ago,\nthere's a song my mother sang to me\nin a voice so sweet and low.\nJust a simple Irish ditty,\nIn her sweet ould fashion way,\nAnd I'd give the world if I could hear\nThat song of hers today.\nToo-ra-loo-ra-loo-ral,\nToo-ra-loo-ra-li,\nToo-ra-loo-ra-loo-ral,\nHush, now don't you cry!\nToo-ra-loo-ra-loo-ral,\nToo-ra-loo-ra-li,\nToo-ra-loo-ra-loo-ral,\nThat's an Irish lullaby."
In [18]:
# def count_sentence_len(lyric):
#     """count average sentence len for a lyric"""
#     sents_list = lyric.split('\n')
#     avg_len = sum(len(x.split()) for x in sents_list) / len(sents_list)
#     return avg_len

# sentence_length_avg = []

x_clean = []

translator = str.maketrans('', '', string.punctuation)
for l in x:
    l = l.translate(translator)
#     sentence_len = count_sentence_len(l)
#     sentence_length_avg.append(sentence_len)
    l = l.replace('\n', ' ')
    
    x_clean.append(l)
In [19]:
# randomly print 5 lyrics
import random
for i in random.sample(range(len(x_clean)), 5):
    print(x_clean[i])
    print("=============================")
Hook On my wrist bout 40 My chain and my neck 15 This Tec with me hold about 50 Blow his block like whatcha say about little key Comin through very foul with a referee The only thing on my mind is money Thats why I dont know nothing when you ask me something Been smoking on dope in the fastest car Bought three chains from Johnny Dang And coulda bought an Aston Martin Verse 1 I got weed I dont need bitches I got money I dont need friends Me and B friends Ben Franklin thats my best friend we pimpin Blocks fuck with them make us put the heat in ya With the ops Nigga I can see the G in ya Come in my house We got dope all up in the weed vender Come too our block We gone make your dumb ass bleed nigga Flexin on niggas like a sucka  Roll up catch an opp then we gone put racks on a nigga Then shoot up the rest of the niggas They gone get the best of us nigga Hook Verse 2 Car go skurtskurt Tec go clickclak Tuhtuhtuhtuhtuhtuh Letta op ride down the block like its good good We gone tuhtuh his windows And my Glock on me case a nigga think its macaroni Smoking on a big blunt of Tooka She wasnt on me when I was in the hood hood Now she say she got that good good Pistols beat like KRK We coming through day by day We dont want no peace treaties We been pouring  thing all day  Made me a peace treaty and we pulling off having a race Walked in the bank with a smirk Walked out laughing away Im rich now I can buy  or  bitches now My niggas in the field still Lam fuck niggas died Fuck niggas dont like how Im living now
=============================
Verse 1 Intimate scenes fresh from my dreams Of a triple X movie scene I could care less about being seen A higher selfesteem if you on me Aint that the way its supposed to be I say baby do your thing Sugars so sweet shouldve rot my teeth But instead it just rottens me yeah Spoiled crazy Hook Public display of affection Got em hatin Pointin in our direction Watch em watch em Public display of affection Makes them wish they had it this way We be at the club the restaraunt The grocery store or the movie Kissin and touchin with my hands all over ya booty Wherever it is yes Ill love it truly Your PDAAA your PDAAA Your PDAAA your PDAAA Verse 2 Remember at the beach we brought the sheets And were harassed by police Good thing we didnt go too deep Everywhere we meet the passion in me just screams I just need you in my reach Baby your suspense is like being intense Has got me convinced Youre the fingers to my instrument Hook Verse 3 From the lobby to the patio Boy youre so crazy And we so compatible From the Starbucks to the Navajo You nasty And we so compatible Break The club the restaraunt The grocery store or the movie Kissin and touchin with my hands all over ya booty Wherever it is yes Ill love it truly Your PDAAA your PDAAA Your PDAAA your PDAAA Outro Your PDAAA I want your PDA
=============================
If ever you need warm fan affection love and protection turn to me If ever you want someone whos tender darling remember turn to me Whenever you find the lifes got you down Ill be around Any time night or day Im only your phone call away I waited for you a half of my lifetime but Ill wait a lifetime If thats how it must be till you turn to me If ever you need warm fan affection love and protection turn to me If ever you want someone whos tender darling remember turn to me Whenever you find the lifes got you down
=============================
Bless me  Bless me Oh Lord Bless me indeed Enlarge My territory Oh Lord Bless me indeed I pray for increase Bless me indeed I pray for increase Increase Increase Oh Lord Bless me indeed enlarge my territory Oh Lord bless me indeed  I pray for increase Bless me indeed I pray for increase Spoken Keep your hand upon me Keep your hand upon me That no evil can not harm me Sunshine and rain Sickness and pain Lord I humbly come to you Enlarge my territory Enlarge my terrirory Oh Lord Bless me indeed I pray for increase
=============================
By dean friedman We squabble and fight we turn out the light We toss and turn all through the night But come the dawn our arms are wrapped around each other tight I know I hurt you babe I didnt mean to make you so upset I love you more than I can say But sometimes I forget Sometimes I forget And when were apart theres an ache in my heart Its been that way right from the start Still there are times I neglect to do my part You know I still remember the first moment that we met I love you more and more each day But sometimes I forget Sometimes I forget I wish I understood how it can feel so good A love that is so sublime But somehow it just slips my mind So let me say this our love is bliss Id risk a fortune just for your kiss And even so I know I have been remiss So all your fears let me allay You know you have no cause to fret My love will never go away Its just sometimes I forget Sometimes I forget
=============================
In [20]:
print(len(x_clean))
20954

2.2 Removing stop words

nltk package has a build in library of stop words. Here I build my own stop-words dictionary basing on sklearn buildin stop word dictionary.

In [21]:
%%time
x_clean = [x.lower() for x in x_clean]

x_clean_new = []
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS)
stop_words = stop_words + ['will', 'got', 'ill', 'im', 'let']

for text in x_clean:
    text = ' '.join([word for word in text.split() if word not in stop_words])
    x_clean_new.append(text)
    
x_clean = x_clean_new
CPU times: user 18 s, sys: 81.1 ms, total: 18.1 s
Wall time: 18.3 s

2.3 Bag-of-words representation

In [22]:
with open('./ospd.txt', encoding='utf-8', errors='ignore') as f1:
    vocab1 = f1.read().split("\n")

print(len(vocab1))
79340
In [23]:
from sklearn.feature_extraction.text import CountVectorizer



# CounterVectorizer can automatically change words into lower case
cv = CountVectorizer(stop_words='english',
                    encoding='utf-8',
                    lowercase=True,
                    vocabulary=vocab1)

bag_words = cv.fit_transform(x_clean)

print('Shape of bag words: {}'.format(bag_words.shape))
print("Length of Vocabulary: {}".format(len(cv.vocabulary_)))
Shape of bag words: (20954, 79340)
Length of Vocabulary: 79340

Let's createe a pandas dataframe containing bag-of-words(bow) model

In [24]:
df_bow = pd.DataFrame(data=bag_words.toarray(),columns=cv.get_feature_names())
df_bow.head()
Out[24]:
aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardwolf aargh aarrgh aarrghh aas aasvogel aba abaca abacas abaci aback abacus abacuses abaft abaka abakas abalone abalones abamp abampere ... zygoid zygoma zygomas zygomata zygose zygoses zygosis zygosity zygote zygotene zygotes zygotic zymase zymases zyme zymes zymogen zymogene zymogens zymogram zymology zymosan zymosans zymoses zymosis zymotic zymurgy zyzzyva zyzzyvas
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 79340 columns

In [25]:
%%time
word_freq = df_bow.sum().sort_values(ascending=False)
CPU times: user 42.8 s, sys: 8.83 s, total: 51.6 s
Wall time: 55 s
In [26]:
word_freq[:30]
Out[26]:
love      31574
like      27405
know      26740
just      24935
oh        19478
time      15466
baby      13818
want      13161
come      12529
cause     12482
way       12077
say       11882
make      11646
yeah      11150
life       9547
heart      9446
right      9295
feel       9116
away       9067
need       8847
day        8622
night      8189
tell       8186
man        8107
girl       7367
world      7097
good       6845
think      6827
theres     6812
little     6764
dtype: int64

2.4 Tf-idf representation

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english',
                             encoding='utf-8',
                             lowercase=True,
                             vocabulary=vocab1)

tfidf_mat = tfidf_vect.fit_transform(x_clean)

print('Shape of bag words: {}'.format(tfidf_mat.shape))
print("Length of Vocabulary: {}".format(len(tfidf_vect.vocabulary_)))
Shape of bag words: (20954, 79340)
Length of Vocabulary: 79340
In [28]:
df_tfidf = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())
df_tfidf
Out[28]:
aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardwolf aargh aarrgh aarrghh aas aasvogel aba abaca abacas abaci aback abacus abacuses abaft abaka abakas abalone abalones abamp abampere ... zygoid zygoma zygomas zygomata zygose zygoses zygosis zygosity zygote zygotene zygotes zygotic zymase zymases zyme zymes zymogen zymogene zymogens zymogram zymology zymosan zymosans zymoses zymosis zymotic zymurgy zyzzyva zyzzyvas
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
24 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20924 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20925 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20926 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20927 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20928 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20930 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20931 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20932 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20933 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20934 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20935 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20936 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20937 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20938 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20939 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20940 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20941 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20942 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20943 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20944 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20945 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20946 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20947 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20948 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20949 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20950 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20951 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20952 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20953 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

20954 rows × 79340 columns

In [29]:
%%time
word_score = df_tfidf.sum().sort_values(ascending=False)
CPU times: user 54.3 s, sys: 9.83 s, total: 1min 4s
Wall time: 1min 8s
In [30]:
word_score[:30]
Out[30]:
love      907.544198
know      696.584566
like      637.814623
just      635.986998
oh        572.149699
time      485.569847
baby      479.713803
want      451.735145
way       417.435273
come      410.556236
say       399.953074
heart     386.085850
cause     376.682279
make      376.064865
away      357.539640
life      357.042271
feel      349.504375
yeah      343.651757
day       329.002055
need      325.014459
right     319.286334
night     316.511432
tell      307.892378
chorus    296.679792
world     291.042023
theres    289.756255
man       280.639421
wont      275.455763
think     273.592617
girl      271.562438
dtype: float64

We can also calculate the corelation matrix, where number in each position (i,j) represents the correlation between song i and song j.

In [31]:
corr = (tfidf_mat * tfidf_mat.T).A
In [32]:
corr.shape
Out[32]:
(20954, 20954)

3. Data Visualization

3.1 Summary

In [33]:
df_clean.head()
Out[33]:
index song year artist genre lyrics
0 104901 it-s-great-to-be-single-again 2007 david-allan-coe Country No more dirty dishes in the sink when I come h...
1 216767 how-can-you-buy-killarney 2007 charlie-landsborough Country An American landed on Erin's green isle\nHe ga...
2 126582 sawing-on-the-strings 2007 alison-krauss Country Way back in the mountains\nWay back in the hil...
3 129927 i-don-t-believe-you-ve-met-my-baby 2006 dolly-parton Country Last night, my tears they were fallin'\nI went...
4 218507 please-don-t-hurry-your-heart 2008 caitlin-cary Country Oh, when you're leaving for the hundredth time...
In [37]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline


plt.style.use('ggplot')
freq = pd.DataFrame(word_freq, columns = ['frequency'])
fig = freq[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
# plt.legend('number of occurrences', loc = 'upper right')



plt.gca().invert_yaxis()
plt.title('words frequencies', fontsize=20)
Out[37]:
<matplotlib.text.Text at 0x118177080>

As we can see in this histogram, the top frequent words are "love", "know", "like" and so on. Among these top 20 frequent words listed in the histogram, the frequency of the top 4 words (love, know, like, just) is almost trible of the last 3 words, i.e. there's a considerable difference between the frequency of different words. One more thing we notice is that, there is a interjection in the list, "oh", and it is the top 6 frequent word. We didn't even notice artists used so many "oh" in the lyrics!

In [71]:
score = pd.DataFrame(word_score, columns = ['Score'])
ax = score[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
plt.legend('score', loc = 'lower right', fontsize=15)
plt.gca().invert_yaxis()
plt.title('tf-idf score')
Out[71]:
<matplotlib.text.Text at 0x1229d75f8>

To figure out the most frequency word for each genre, TF-IDF may be more appropriate (given that TF-IDF reflects how important words to the document). From the plot above, we can see that the top frequent words are totally different from those words listed according to term frequency.

And we can see that there are some words, like "al", "bo", "dor", "la", have high TF-IDF score. This may due to the phynominon that these words only exist in some documents (songs), makes them so "special" and are highlighted as important words for documents.

TF-IDF analysis for each genre is needed.

In [39]:
# code example from https://www.kaggle.com/carrie1/drug-of-choice-by-genre-using-song-lyrics
df_clean['word_count'] = df_clean['lyrics'].str.split().str.len()
df_clean.info()
f, ax = plt.subplots(figsize=(10, 9))
sns.violinplot(x = df_clean.word_count)
plt.xlim(-100, 1000)
plt.title('Word count distribution', fontsize=26)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20954 entries, 0 to 21599
Data columns (total 7 columns):
index         20954 non-null int64
song          20954 non-null object
year          20954 non-null int64
artist        20954 non-null object
genre         20954 non-null object
lyrics        20954 non-null object
word_count    20954 non-null int64
dtypes: int64(3), object(4)
memory usage: 1.3+ MB
Out[39]:
<matplotlib.text.Text at 0x133ef79e8>

The violinplot plot the distribution of all the songs according to number of words of lyrics.

The figure shows that most of songs have lyric length form 100 to 300 words. The lyric length median locates around 200. And only a small part of songs' lyric length longer than 400.

This make sense for the real lyrics length. After all, people may get tired of songs with too many lyrics and are more unlikely to fall in love with the songs only have a few words.

Above plot is for all lyrics, without classifying by genre. We still cannot get the desired feature for each genre.

In [40]:
f, ax = plt.subplots(figsize=(10, 9))
sns.boxplot(x = "genre", y = "word_count", data = df_clean, palette = "Set1")
plt.ylim(1,2000)
Out[40]:
(1, 2000)

To figure out the lyric length feature for each genre. We group the data by genre and get box plot for each genre.

According to the plot, medians of most box are under 250 (around 200). Only the median for Hip-Hop is around 500, more than double length than the others. For the maximum, Electronic, Rock, Hip-Hop have the top 3 longest lyrics. And there's no big difference for the minimum for all the genres.

In general, the top 5 longest lyrics genres(named) are Hip-Hop, Pop, R&B, Electronic, Indie. The last 3 genres(named) are Jazz, Metal, Country. It seems that the genres with up tempo are more likely to have longer lyrics, and vice-versa. But we still need to pay attention to some exception. Metal songs with up tempo, however, mostly they have shorter lyrics than the other up-tempo songs. Thus, the length of lyrics can be a reference for genre classification but should not be the decision metric.

distribution across time

In [42]:
mpl.rc("figure", figsize=(12,12))
sns.violinplot(x='genre', y='year', data=df_clean)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x133d79908>

Looks like the distribution is biased with extreme values. So let's check outliers.

In [46]:
df_clean[df_clean['year'] <= 2000].shape[0]
Out[46]:
0

Drop songs before 2000 and plot again.

In [47]:
for row in df_clean[df_clean['year'] <= 2000].index:
    df_clean.drop(row, inplace=True)
In [48]:
mpl.rc("figure", figsize=(15, 25))
sns.violinplot(x='year', y='genre', data=df_clean, inner="quartile")
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1221d45c0>

We can see that the distributions are quite different. Country, Metal, Pop, R&B and Rock have a more centrilized distribution, mostly created during 2005~2010. Other genres have a quite streched distribution. Other songs(song's not labled with a genre) are mostly composed after 2012, propably because new songs don't have labels yet.

Several geners had a big-bang around 2006~2009. We are wondering if this distribution was due to reality or just crawlling problems

top artist

In [49]:
top_artist = df.artist.value_counts().head(8).index.tolist()

# df_clean['artist'].isin(top_artist)
# df_clean.loc[df_clean['artist'] in]

df_top_artist = df_clean.loc[df_clean['artist'].isin(top_artist), :]
df_top_artist.head()
Out[49]:
index song year artist genre lyrics word_count
3 129927 i-don-t-believe-you-ve-met-my-baby 2006 dolly-parton Country Last night, my tears they were fallin'\nI went... 173
7 68325 that-s-how-much-i-love-you 2014 eddy-arnold Country Well if I had a nickel I know what I would do\... 243
14 129783 two-little-orphans 2006 dolly-parton Country Two little children a boy and a girl\nSat by a... 155
35 129690 let-her-fly 2006 dolly-parton Country There's a wreath on the door\nShe don't live h... 170
43 129831 i-walk-the-line 2006 dolly-parton Country (Johnny Cash)\nI keep a close watch on this he... 254
In [50]:
df_top_artist.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 329 entries, 3 to 21576
Data columns (total 7 columns):
index         329 non-null int64
song          329 non-null object
year          329 non-null int64
artist        329 non-null object
genre         329 non-null object
lyrics        329 non-null object
word_count    329 non-null int64
dtypes: int64(3), object(4)
memory usage: 20.6+ KB
In [52]:
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='artist', y='year', data=df_top_artist, inner="quartile")
sns.set(font_scale=3)

For the top 8 artists, we plot this figure to explore their high-yield years. For artist eddy-arnold, dolly-parton, eminem, barba-streisan and bee-gees, their most songs were composed during 2005~2010. And for the cris-crown and bob dylan, it seems they kept creating for a long time. However, bob-dylan's works are sort of "ahead of time". It may due to the mis-input of the information.

In [53]:
print(df_bow.shape)
print(len(y))
(20954, 79340)
20954

length of songs

In [54]:
df_bow['length'] = df_bow.sum(axis=1)
In [55]:
# create two new columns: 
# @ length: length of documents basing on bag-of-word model
# @ genre: genre of the record

df_bow['genre'] = pd.Series(y).values
In [56]:
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='length', y='genre', data=df_bow, inner="quartile")
sns.set(font_scale=3)

This is another way to calculate lyrics' length basing on word bags. The violin plot for the lyric length among each genre plot corresponding to the box-plot above.

Next we want to check the top 10 frequent words of each genre.

In [57]:
genre_count = df_bow.groupby('genre').sum()
genre_count.drop('length', axis=1, inplace=True)
genre_count.head()
Out[57]:
aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardwolf aargh aarrgh aarrghh aas aasvogel aba abaca abacas abaci aback abacus abacuses abaft abaka abakas abalone abalones abamp abampere ... zygoid zygoma zygomas zygomata zygose zygoses zygosis zygosity zygote zygotene zygotes zygotic zymase zymases zyme zymes zymogen zymogene zymogens zymogram zymology zymosan zymosans zymoses zymosis zymotic zymurgy zyzzyva zyzzyvas
genre
Country 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Electronic 0 13 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Folk 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hip-Hop 1 86 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Indie 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 79338 columns

In [58]:
genre_count_new = genre_count.transpose()
genre_list = df_clean.genre.unique().tolist()
In [62]:
for genre in genre_list:
    t = genre_count_new.nlargest(10, genre, keep='first')[genre]
    
    fig = plt.figure(figsize=(6,4))
    fig.suptitle(genre, fontsize=20)
    plt.xticks(rotation='vertical')
    sns.barplot(t.values, t.index, alpha=0.8)
sns.set(font_scale=3)

In above histogram, we list some top frequent words for each genre. For different genres, they have top-10 frequent words in common and. And these information can be visualized in word cloud figures in Part 4.

From those histograms, it is pretty straightforward that 'love' is almost every types of music cared about. And also other words they share in common, which are 'know','time', 'oh' etc. And also many of those words are verbs. It looks like hip-hop music has a quite differnet set of frequent words, distinctive from other genres.

4. Word Cloud

Now it is 'wordcloud' time. Word cloud is a visual representation of text data, and it is a very efficient way to represent word frequencies.

First let's try to draw the overall wordcloud basing on term frequency.

In [63]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
plt.style.use('ggplot')

all_lyrics = ''
for lyric in x_clean:
    all_lyrics += (' '+lyric)
In [64]:
# code example from https://amueller.github.io/word_cloud/index.html
wordcloud = WordCloud(max_font_size=60).generate(all_lyrics)
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Out[64]:
(-0.5, 399.5, 199.5, -0.5)

We can clearly see that the most frequently used word is 'love' over all, then comes with 'got',

In [65]:
word_freq[:30]
Out[65]:
love      31574
like      27405
know      26740
just      24935
oh        19478
time      15466
baby      13818
want      13161
come      12529
cause     12482
way       12077
say       11882
make      11646
yeah      11150
life       9547
heart      9446
right      9295
feel       9116
away       9067
need       8847
day        8622
night      8189
tell       8186
man        8107
girl       7367
world      7097
good       6845
think      6827
theres     6812
little     6764
dtype: int64

As we can see, the word cloud describes word frequency in a visuable way.

Let's try plot word clouds in different genres.

In [66]:
d = {'genre': y.tolist(), "lyric": x_clean}
df_plot = pd.DataFrame(d)
df_plot.head(10)
Out[66]:
genre lyric
0 Country dirty dishes sink come home dont worry spend n...
1 Country american landed erins green isle gazed killarn...
2 Country way mountains way hills used live mountaineer ...
3 Country night tears fallin went bed sad blue dream dre...
4 Country oh youre leaving hundredth time day looking re...
5 Country left home just promise mother return christmas...
6 Country dont want heartaches dont want teardrops love ...
7 Country nickel know id spend candy id spend candy caus...
8 Country come mouth blood eye swoll shut mustered punch...
9 Country hey hey lookin boy gotta tell liking view scuf...

Now let's separate those lyrics into different genres.

In [67]:
# create a dictionary and store all lyrics basing on their genre
lyrics = {}
for genre in df_plot.genre.unique().tolist():
    lyrics[genre] = ' '
    for row in (df_plot[df_plot['genre'] == genre].index):
        lyrics[genre] = lyrics[genre] + ' ' + df_plot.loc[row, 'lyric']
In [68]:
for genre, lyric in lyrics.items():
    wordcloud = WordCloud(max_font_size=60).generate(lyric)
    
    fig = plt.figure(figsize=(10,8))
    fig.suptitle(genre, fontsize=24)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout()

In thoes word cloud, word 'love' is almost the most frequent one in each genre. And word 'life', 'know' and 'time' etc. are frequently used as well. But there are also some differences among those genres. For example, in jazz, word 'heart' used more that other genre, and there are more dirty words in hip-hop, which makes sense. After exploring all the lyrics, we can make a conclusion that most of the lyrics have some words in common, but depending on what kind of music they are, they do have unique words. Based on this results, we can make genre prediction in the future.